home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Power Programmierung
/
Power-Programmierung (Tewi)(1994).iso
/
assemblr
/
library
/
sampler0
/
soundexa.asm
< prev
next >
Wrap
Assembly Source File
|
1988-03-28
|
15KB
|
281 lines
TITLE SOUNDEXA
PAGE ,132
;Author : bp-programs, Kalamazoo, Michigan
;Date: Jan 88, for Clipper Summer 87
;Source Code protected by United States Copyright Law
;Permission given for code to be incorporated in other programs by author
;Syntax: SOUNDEXA(string[,filler])
;The soundex code is useful to look up names where you aren't sure of the
;spelling. Codes for similar sounding names are generally (but NOT always)
;close together. The code has the format LETTER-DIGIT-DIGIT-DIGIT. LETTER is
;simply the upper case first letter of the name. DIGITs are derived from the
;translation table below. Empty positions are NOT translated. If there are
;two or more letters with the same code following each other in the name, only
;ONE code number is used. 'Schmidt' is 'S530', not 'S253' or 'S533'. If there
;are more than three code numbers, the extra ones aren't used. If there are
;fewer, the code is padded with zeros. (But see about FILLER below).
;Soundex ABCDEFGHIJKLMNOPQRSTUVWXYZ
;Translation Table: 123 12 22455 12623 1 2 2
;SOUNDEXA is an assembly language implementation of the soundex code. It
;follows my interpretation of the algorithm found on pages 392/393 of Knuth's
;book 'Sorting and Searching', volume 3 of "The Art of Computer Programming".
;It does NOT return the same code as the soundex routine in examplec.c
;(SOUNDEXC) distributed with Clipper Summer 87 or the Rettig soundex routine
;distributed with Clipper Autumn 86 in extenddb.prg (SOUNDEXD).
;The main differences among the three implementations are listed below.
; SOUNDEXA SOUNDEXC SOUNDEXD
; ---------------------- --------------------- ----------------
;Format A999 A999 A9999
;Dupes Skips ltrs generating Skips identical ltrs Skips duplicate
; the same code which are adjacent in original code numbers even
; immediately adjacent in text if not adjacent in
; original text original text
;Null 1. Null string 1. Null string 1. Null string
;Returns 2. Completely non-alpha 2. Non-alpha/non
; string space characters
; except first char
;Fault 1. Ltrims leading non- 1. Does not trim, uses 1. Does not trim,
;Tolerance alpha characters non-alpha as lead uses any char
; 2. Skips intermediate 2. Aborts with non- 2. Skips inter-
; non-alpha characters alpha/non-space mediate non-
; except first char alpha chars
;Speed 3 secs/5000 repeats 9 secs/5000 repeats 90 secs/5000 repts
;I believe, of course, that SOUNDEXA is the 'best' implementation because
;it's closest to Knuth's algorithm, most fault tolerant, fastest (and also
;smallest, by the way) and the most FLEXIBLE. More about this below.
;Knuth's algorithm uses 0s (character zero) to fill trailing empty slots.
;This makes sense when you're constructing an index, such as
; INDEX ON SOUNDEXA(LASTNAME) TO NAMX
;However, when you're SEEK/LOCATEing with SOUNDEX you generally want to find
;all likely candidates and want to make sure that you don't miss any. You'd
;rather find a few wrong ones than miss a single right one. In that case
;you want to include even partial matches, such as
; LOCATE ALL FOR TRIM(SOUNDEXA(PART_NAME))
;SOUNDEXA allows you to select between two fillers, spaces or '0'. Even
;though zeros are 'standard', I find spaces more flexible and have made them
;the default. By specifying a second argument SOUNDEXA(LASTNAME,FILLER) once,
;you change the state of the routine. If FILLER is a '0' (as a character, not
;a number), all future calls to SOUNDEXA will use zeros for filling. If
;FILLER is any other character (or even a null string), SOUNDEXA will use
;spaces in the future. If there isn't a second argument, SOUNDEXA will use
;what you specified before or the default. If you prefer zeros as the default,
;change the FILLER DB to '0' in the DATASG.
;===================================================
EXTRN __PARINFO:FAR ;Clipper EXTEND routine, tells how many arguments
EXTRN __PARC:FAR ;Clipper EXTEND routine, gets a character argument
EXTRN __RETC:FAR ;Clipper EXTEND routine, returns a character value
SX_LENGTH EQU 4 ;Length of soundex code
DGROUP GROUP DATASG ;Ties this segment to the other data segments
;of Clipper. DS points to this DGROUP when
;we arrive in the assembly routine
DATASG SEGMENT WORD PUBLIC 'DATA' ;All PUBLIC segments with the name DATASG
;will be combined by the linker. All segments
;with the class 'DATA' will be adjacent to
;each other. WORD means that the segment
;starts on an even byte, which can sometimes
;be minutely faster in an 8086/80286 machine.
SOUNDEX DB SX_LENGTH DUP (?) ; Space for SOUNDEX result
DB 00 ; Terminator byte
;Strings in C and Clipper are terminated by a NULL (or NUL or
;NIL, it all means the same thing). There is no length byte
;or word as in BASIC or Turbo Pascal.
FILLER DB ' ' ; Filler byte for padding of SOUNDEX, can be
; space (default) or '0'
; Translate table from UC letters to SOUNDEX codes
; Omitted letters return NULL
; 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
; ' 123 12 22455 12623 1 2 2'
TRANSLATE db 0,'123',0,'12',0,0,'22455',0,'12623',0,'1',0,'2',0,'2'
DATASG ENDS
;==================================================
;==================================================
_PROG SEGMENT BYTE PUBLIC 'CODE' ;All PUBLIC segments with
;the name _PROG will be com-
;bined, all segments with the
;class 'CODE' will be
;adjacent. BYTE means that
;the segments will be aligned
;(stuck together) without any
;padding.
ASSUME CS:_PROG, DS:DGROUP, ES:NOTHING ;This is the way the segment
;registers are set up when we
;arrive here from Clipper.
PUBLIC SOUNDEXA ;Used in linking to Clipper, lets Clipper know
;where this routine is.
SOUNDEXA PROC FAR ;The name of our routine (procedure)
PUSH BP ;The Clipper extend documentation on disk
PUSH DI ;says that we have to save registers
PUSH SI ;BP, DI, SI, ES and DS. We are not
PUSH ES ;disturbing BP, so we may not have to save it.
PUSH DS ;But the Clipper routines __PARINFO, __PARC
;and __RETC may do so, we don't know.
;Ensure null string in case of missing argument or no letters
;We do this by moving a NULL byte in the first place of the
;SOUNDEX code. It will be overwritten if there's no error.
MOV BYTE PTR DGROUP:[SOUNDEX], 0
SUB AX, AX ;Faster and smaller than MOV AX, 0
PUSH AX
CALL __PARINFO ;Find out how many arguments passed
ADD SP, 2 ;Clean up stack. C routines, unlike BASIC
;or Pascal do NOT clean up the stack.
CMP AX, 1 ;Is there 1 argument?
JE MAIN_ROUT ;Yes, use stored filler for conversion
CMP AX, 2 ;Are there two arguments?
JNE LEAVE ;No, an invalid number of arguments. Leave.
;Two arguments, get the second one - a new FILLER
PUSH AX ;AX is always 2 here
CALL __PARC ;Get the address of FILLER string
ADD SP, 2 ;DX:AX hold pointer to string
MOV ES, DX
MOV BX, AX ;Use ES:BX to point to FILLER string
MOV AL, ' ' ;Load default space character
CMP BYTE PTR ES:[BX], '0' ;Is the new FILLER a '0'?
JNE SX010 ;No, all set with space
MOV AL, '0' ;Yes, make it a '0'
SX010: MOV DGROUP:FILLER, AL ;Set Filler character
MOV AX, 1
;AX is always 1 here, either set above or CMP AX, 1
;Pointer to string in DX:AX (SEG:OFS) for Clipper S87
MAIN_ROUT: PUSH AX
CALL __PARC ;Get pointer to string to convert
ADD SP, 2 ;Pointer to string returned in DX:AX (SEG:OFS)
;Set up pointer registers, seg and ofs
;DS:SI - String to convert, pointer incrementing
;ES:DI - SOUNDEX code in DGROUP, pointer incrementing
;ES:BX - TRANSLATE in DGROUP, points always to base
PUSH DS
POP ES ;ES now points to DGROUP
MOV DS, DX ;And DS to where ever Clipper stores
;its string arguments.
MOV SI, AX ;DS:SI point to start of string to convert
MOV DI, OFFSET DGROUP:SOUNDEX ;ES:DI point to start of
;SOUNDEX
MOV BX, OFFSET DGROUP:TRANSLATE ;ES:BX point to TRANSLATE base
CLD ;Work upward in string instructions
ASSUME DS:NOTHING, ES:DGROUP
;Let MASM know that we've switched seg regs around
MOV CX, SX_LENGTH ;Maximum SOUNDEX length
FIRST_LTR: LODSB ;Get start byte from string to convert
OR AL, AL ;At end of string to convert?
JZ LEAVE ;NULL string or no letters anywhere in
;string, return a NULL string
; Real chararacter here, but is it a letter?
AND AL, 0DFH ;This converts letters to upper case,
;destroys other characters. But since
;we don't care about those, it's ok.
MOV AH, AL ;Save the possible starting letter
SUB AL, 'A' ;Subtract the ASCII value of A which
;is 65. This makes A 0, B 1, C 2 etc.
JS FIRST_LTR ;Negative, so not a letter, try again
CMP AL, 'Z' - 'A' ;ASCII Z minus ASCII A is the largest
;real letter value.
JA FIRST_LTR ;Not a letter either, try again
;We found a valid UC starting letter. It's both in AH and AL
XLAT DGROUP:TRANSLATE ;Convert to SOUNDEX code 1-6 or NULL
;XLAT adds the value in AL to BX and
;fetches the character pointed to by
;(normally) DS:BX+AL. Since in this
;case DS points the NOTHING and ES to
;DGROUP, MASM is smart enough to make
;a segment override so that XLAT gets
;the byte at ES:BX+AL and puts it in
;AL. (Replacing the original pointer)
XCHG AH, AL ;After switch, AH holds code,
;AL the UC starting letter
STOSB ;Put first letter into SOUNDEX
LOOP DIGITS ;Decrement CX and jump to actual
;digit conversion. Skip over one
;piece of code.
ERR_DIGITS: SUB AH, AH ;Jump to here only when looping back
;and we want to clear out false
;'previous' letter matches if
;there are non-letters in between.
DIGITS: LODSB ;Get the next character
OR AL, AL ;ORing a value is the fastest way to
;find out if it's NULL (end of string)
JZ ALL_DONE ;Trailing NULL detected
;Not at end of string to convert
AND AL, 0DFH ;Convert to UC
SUB AL, 'A' ;Subtract ASCII 'A'
JS ERR_DIGITS ;Negative, not a letter
;Clear out previous code in AH
CMP AL, 'Z' - 'A'
JA ERR_DIGITS ;Not a letter either, clear previous
;code in AH
;Valid UC letter in AL, 'previous' code in AH
XLAT DGROUP:TRANSLATE ;Convert to SOUNDEX code or NULL
CMP AH, AL ;Same code as previous letter?
JE DIGITS ;Yes, duplicate, don't add to SOUNDEX
;New code, not a duplicate
MOV AH, AL ;Save it as the new 'previous' code
OR AL, AL ;Is it a real or a null code?
JZ DIGITS ;Null code, don't add to SOUNDEX
;Valid code in AL, not the same as previous, add to SOUNDEX
STOSB
LOOP DIGITS ;Continue until SX_LENGTH in SOUNDEX
ALL_DONE: JCXZ LEAVE ;Complete soundex, CX counted down
MOV AL, DGROUP:[FILLER] ; ' ' or '0'
REP STOSB ;Fill remainder of SOUNDEX with FILLER
LEAVE: POP DS ;Restore DGROUP segment into DS
;Clipper routines, such as __RETC
;expect DS to be pointing to DGROUP
PUSH DS ;Push segment of SOUNDEX string
MOV AX, OFFSET DGROUP:SOUNDEX
PUSH AX ;And push the offset of SOUNDEX
CALL __RETC ;Return pointer to SOUNDEX to Clipper
ADD SP, 4 ;Clean up stack
;DS already popped above
POP ES ;Get remainder of saved registers back
POP SI
POP DI
POP BP
RET ;Go back to Clipper
SOUNDEXA ENDP
_PROG ENDS
END